Financial Contribution to Presidential Campaigns in California
Heshuang Zeng 07/01/2016
Summary
This report analyzes the presidential financial support pattern in California using open dataset. We found that the Democratic candidates enjoy wide popularity in California, especailly Bernard and Hillary. While the number of Repubican Candidates is large, their collective support received is less than either Bernard or Hillary.
Further examining the support pattern by candidate inside democratic party we found Bernard wins more then half of total support counts and he is more popular in the Northern California and low-income class. Hillary is the second popular candidate by support count. However, she is more welcome in the upper class and high-income neighborhoods, since most of big contributions (more than 1000USD) are for her. Candidate support within a party also differs by supporters’ income level, zipcode, occupation.
The analysis indicates the support pattern in California are highly related to income level, residential location, as well as occupation.
The Data Structure
The dataset records the presidential contribution from individuals in California in presiential election. It’s a tabulor dataset contains 548166 observations of 18 variables dated from Jan 1, 2014 to Apirl 30, 2016. Each observation is a transaction that a contributor made to support a candidate. Therefore, each transition contains three parts of information: - The contributor, including its name, zipcode, city, employer, occupation, and state; - The candidate, including candidate’s name, committee ID and candidate ID; - Transition details, including receipt date, receipt amount, receipt description, memo code, memo, from_type, transition ID, amount, file number and election type. Exploring this rich dataset helps us understand the political landscape in California presidential election.
Main interests of Exploratory Research
- Univariate. It is interesting to grasp an understanding on the individual variables, for example, count of cand_name helps us understand the popularity of candidates in California, and the distribution of supports in zipcode, occupation, city and amounts of donations would be useful as well.
- Bivariate and Multivariate. It is also possible to uncover support pattern, for example, relationship between type of supporters and candidates by location or by occupation.
FALSE 'data.frame': 548166 obs. of 18 variables:
FALSE $ cmte_id : Factor w/ 25 levels "C00458844","C00500587",..: 7 7 7 7 7 7 7 7 7 7 ...
FALSE $ cand_id : Factor w/ 25 levels "cand_id","P00003392",..: 13 13 13 13 13 13 13 13 13 13 ...
FALSE $ cand_name : Factor w/ 25 levels "Bush, Jeb","cand_nm",..: 20 20 20 20 20 20 20 20 20 20 ...
FALSE $ contrb_nm : Factor w/ 100017 levels "_BOOTH, ELAINE S.",..: 4881 12668 12678 12681 12694 899 5959 13198 13198 13198 ...
FALSE $ contrb_city : Factor w/ 1464 levels "","*MORENO VALLEY",..: 1138 1175 721 570 1184 249 1072 1179 1179 1179 ...
FALSE $ contrb_st : Factor w/ 2 levels "CA","contbr_st": 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ contrb_zip : Factor w/ 85151 levels "","00000","000090272",..: 53916 41095 1547 33840 11562 23445 66550 71648 71648 71648 ...
FALSE $ contrb_employer : Factor w/ 34055 levels ""," APPLE INC.",..: 4793 18837 26433 4109 11817 9836 20673 24075 24075 24075 ...
FALSE $ contrb_occupation: Factor w/ 15689 levels ""," REAL ESTATE BROKER",..: 11518 1989 15008 4683 11277 4683 9115 11855 11855 11855 ...
FALSE $ contb_receipt_amt: Factor w/ 5870 levels "-.44","-.76",..: 599 599 3760 2425 3675 3984 4925 4415 1650 4415 ...
FALSE $ contb_receipt_DT : Factor w/ 488 levels "01-APR-15","01-APR-16",..: 452 452 452 452 452 452 452 452 452 452 ...
FALSE $ receipt_desc : Factor w/ 66 levels "","* EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING",..: 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ memo_cd : Factor w/ 3 levels "","memo_cd","X": 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ memo_text : Factor w/ 250 levels "","*","* EARMARKED CONTRIBUTION: SEE BELOW",..: 3 3 3 3 1 3 3 3 3 3 ...
FALSE $ from_tp : Factor w/ 4 levels "form_tp","SA17A",..: 2 2 2 2 2 2 2 2 2 2 ...
FALSE $ file_num : Factor w/ 129 levels "1003942","1004025",..: 64 64 64 64 64 64 64 64 64 64 ...
FALSE $ tran_id : Factor w/ 545830 levels "A000771210424405B8CF",..: 353073 352207 347385 351166 323101 352521 348822 349557 349644 349911 ...
FALSE $ election_tp : Factor w/ 5 levels "","election_tp",..: 4 4 4 4 4 4 4 4 4 4 ...
Univariate Analysis
Key variables
- The key variables are transaction amount
contb_receipt_amt, candidate namescand_name, contributors’ location by city contr_city or zipcode contr_zip and occupation contr_occupation.
- Other features such as the party of candidates
party, date of the donation contb_receipt_DT` and the category of contributioncontrb_category`` by amount might also be helpful.
Tidy and Clean the Data
Three changes are made to tidy and clean the data
- Changed the format of
contb_receipt_DT from factor to date.
- Change the format of
contb_receipt_amt from factor to character then to numeric. I notice there are some negative values, so in the analysis we d better only use the positive value.
- Extract the first five digits to make them consistent using stringr. In the original file, the zip code data is inconsistent. Some are nine digits and some are five.
Create New Variables
Two new variables are created.
- I create a new variable called ‘party’ to help understand which party gain more popularity in California.
- I also create a variable called ‘contribution_category’ which includes five levels: negative, 0-50, 51-200, 201-1000, and 1000+.
Univariate Plots Section
Clean and Create New Variables
Creating new variable ‘party’
Since using loop function to calculate the new field is too slow, I subset the dataset into three, then add party variable to each of them and rbind them.

Tidy the variable of contribution amount
Step 1.Turning it from factor to numeric
FALSE Min. 1st Qu. Median Mean 3rd Qu. Max.
FALSE -10000.00 15.00 27.44 135.20 100.00 10800.00
Step 2. Plot the contribution amount by distribution 
Create a new variable called ‘contrb_category’
This variable categorizes the contribution amount to five categories: negative, 0-50, 51-200, 201-1000, more_than_1000. We plot the number of contribution by category and find most contributions are under 50USD.

FALSE $title
FALSE [1] "Count by Contribution Amount Backet"
FALSE
FALSE attr(,"class")
FALSE [1] "labels"
Clean the contrb_zip variable
Update the zipcode by only including the first five digits and find the top 10 neighborhoods by count.
FALSE
FALSE 94110 94114 94611 90046 94117 95060 94109 90049 90069 90405
FALSE 4506 3893 3234 3089 3031 2738 2349 2328 2325 2301
Ditribution of Key Variables
Candidates and Contribution Count
The top 10 candidates by popularity (Number of Supports)
FALSE
FALSE Sanders, Bernard Clinton, Hillary Rodham
FALSE 309233 125141
FALSE Cruz, Rafael Edward 'Ted' Carson, Benjamin S.
FALSE 53207 27337
FALSE Rubio, Marco Fiorina, Carly
FALSE 13755 4696
FALSE Paul, Rand Bush, Jeb
FALSE 4255 3109
FALSE Kasich, John R. Trump, Donald J.
FALSE 2857 1370
Contribution Count by City
Top 10 Cities by Number of Supports
FALSE
FALSE LOS ANGELES SAN FRANCISCO SAN DIEGO OAKLAND SAN JOSE
FALSE 40072 37034 19257 13507 13246
FALSE BERKELEY SACRAMENTO SANTA MONICA LONG BEACH SANTA CRUZ
FALSE 10469 9207 5819 5758 5448
FALSE SANTA BARBARA SANTA ROSA PASADENA PALO ALTO FRESNO
FALSE 5404 5311 4862 4727 3920
FALSE WALNUT CREEK IRVINE BAKERSFIELD DAVIS SUNNYVALE
FALSE 3521 3392 3119 3079 3008
Contribution Count by Occupation
Top 10 Occupations by Number of Supports
FALSE
FALSE NOT EMPLOYED RETIRED ATTORNEY
FALSE 83733 81385 12342
FALSE TEACHER ENGINEER SOFTWARE ENGINEER
FALSE 12126 8500 8106
FALSE HOMEMAKER PHYSICIAN INFORMATION REQUESTED
FALSE 6642 6505 5403
FALSE CONSULTANT
FALSE 5237
Contribution Count Overtime
Here we examine the supports made after 2015 
FALSE Min. 1st Qu. Median Mean 3rd Qu.
FALSE "2013-11-05" "2015-12-31" "2016-02-29" "2016-02-01" "2016-03-31"
FALSE Max.
FALSE "2016-04-30"
Number of Supports(Count) by Employer
The top ten employers that give the most number of supports. As it is quite unclear and contains too many categories. I did not carry this over in further analysis.
FALSE
FALSE RETIRED
FALSE 57203
FALSE NONE
FALSE 50067
FALSE NOT EMPLOYED
FALSE 48536
FALSE N/A
FALSE 31263
FALSE SELF
FALSE 29351
FALSE SELF EMPLOYED
FALSE 28478
FALSE SELF-EMPLOYED
FALSE 26126
FALSE INFORMATION REQUESTED
FALSE 5308
FALSE INFORMATION REQUESTED PER BEST EFFORTS
FALSE 4056
FALSE
FALSE 3389
Bivariate Plots Section
Top candidates by total contribution
Who are the 10 most popular candidates? How much contribution do they received? I created a new table on the candidate’s donation received called receipt_by_candidate I plot the propotion of support count and the propotion of total contribution by candidate. We found Hillary is the candidate that received the most fund, while Sanders is the candidate gain the largest number of individual support. Note, I color the barchat by color, these two charts actually are multivarite plots. But I place it here to make the report better follow.


Total contribution and contribution distribution by party
We plot the count of support and the distribution of contribution amount by party.
The average donoation to the democratic party is far less than that made to republican party. 

The contribution distribution of the top ten candidate
Plotting the contribution distribution by candidates for the top ten, I find the contributions made to democratic candidates are relatively small. Bernard enjoy wide popularity but most of his contributions are less than 50 dollars.

Top 10 cities by total contribution
Top ten cities by total contribution We first select the top 10 cities where the number of transactions are over 100 Then we plot the top 10 cities by total contribution.
LA and San Francisco are top cities that contribute the most donation

- Top 10 cities by mean contribution
In response to question of “Where do the rich donors live?”. We look into the cities having highest average contribution and also with over 100 transactions. We found there is limited overlap between top cities with total contributions and rich cities with high average contribution. Rich communities did not dominate CA presidential contribution.

Top 10 occupations by total contribution
- The retired and unemployed contribute the most in persidential campaign, the homemaker is also a very important political force, since its average donation is even more than attorny, only next to the CEOs and persidents.

- Homemarkers, attorney and IT workers top by mean contribution

Total Contribution by Zipcode
The top ten neighborhood by total contribution

Contribution Counts and Total Overtime by Party
- The democratic dominates in terms of support counts, especially in 2016.
- In terms of total amount, in 2015, the contribution is similar between two parties, but after 2016, domocratic has significant edge. This might be caused by unexpected raised popularity of Donald Trump and the dropoffs of other republican candidates.
Note, the plot on contribution overtime is a multivariate plot, but I placed it here for the ease of comparison.


Bivariate Analysis
1. Significant popularity of Democratic Candidates
Democratic supportors dominate California, no matter in terms of support count and total contribution. However, the number of supports and the total fund received by candidates is not necessarily positively related in California.
- Bernard and Hillary More than half of support counts are for Sanders, but Clinton raise the most contributions, twice the Sanders. Mean donation by individual varies by candidate a lot.
2. Key supportive forces by occupation.
Supportors by occupation is interesting. The retired comprises the largest supportive force. The not-employed is the second largest group in presidential compaign donation, which is quite unexpected. Homemaker also have very strong power in political landscape in California as they contribute the forth largest amount of money to candidates with very high average. It would be interesting to see who they support
3. Number of Contributions over time by party
Democratic candidates receive incresed number of contributions overtime, while support to republican candidate has been reduced since 2016. This might due to the drop off of many republican candidates and unexpect raise of Trump.
Multivariate Plots Section
Top candidates’ contribution composition
As Domocratic party has significant popularity in California and Hillary and Bernard are two most popular candidate, it might be interesting to create a new variable called cand_name2 include Hillary, Bernard and other republican candidates.
As shown in the following plot, Bernard received supports mostly from small contributions.While Hillary seem to be popular accross different groups. She also managed to receive bulk of big contributions, each of which is over 1000USD.

Support Pattern by Occupation
Retired and homemaker more likely to support republic candidates. They are also more likely to support Hillary than Bernard. Not-employed only support democratic candidates, and most of them support Bernard.

Support pattern in top and bottom cities in terms of contribution avereage
Both high and low communities have diverse political voices. However, high-income communities are more likely to support republican candidates or Hillary. Sanders has larger support in the poor community.


Geospatial Analysis
Even though we analyzed the support pattern in rich and poor cities, it is still quite abstrate to understand the full picture state-wise. This part explores the support pattern in candidate support, we create a new dataset Bernard_index that maps the support rate of Bernard, Hillary and Republican Candidates as well as the total contribution in each zipcode area. I linked this table with the geoinformation table of California. Two maps are created
The Bernard_index map shows support rate of Bernard (# of Bernard contributions/total of contributions) in each zipcode, the bluer the area, the higher the support rate of Bernard
The Bernard_index and contribution size map plots a poit of each zipcode where the size indicates the total contributions and the color indicates the Bernard support rate, the bluer, the higher the support rate.
From these two maps we can see, the support to Bernard is overwhelming in California, especially in North California and the Bay area.
Create a new table call zip_summary which includes a Bernard_index
Plot the Bernard Index by polygon - The Bernard_index map
FALSE OGR data source with driver: ESRI Shapefile
FALSE Source: "cb_2014_us_zcta510_500k", layer: "cb_2014_us_zcta510_500k"
FALSE with 33144 features
FALSE It has 5 fields

Plot the Bernard Index by point -The Bernard_index and contribution size map
FALSE $title
FALSE [1] "The Bernard_index and contribution size map in California"
FALSE
FALSE attr(,"class")
FALSE [1] "labels"

Multivariate Analysis
Candidates’ popularity varies by geography and occupation
Mulivariate Analysis confirm again that candidates seems to have different popularity in different income groups.
- The affluent are more likely to support the Republican Candiates, and when they support Domocratic Candidate, they are more likely to support Hillary. This is reflected in support parttern by geography.
- Retired, Homemakers and Attorneys are more likely to support Republican Candidates or Hillary, while unemployed are more likely to support Bernard.
- Spatial analysis also shows that Bernard is more popular in the Northern California including the bay area, while southern California people more likely to vote for others
Classification Model
it is also possible to contruct a classification model (using logistic regression, ramdon forest, gradient boosting, svm, or neutro nework) to predict the candidate based on the amount of contribution, the zipcode, and their occupation. I did not try any model here, but would be interested in exploring these options later.
Key Plots and Summary
1. Candidates by Popularity
Democratic Candidates has significant popularity in California, while Bernard received more than half of support counts in total and Hillary received about 25% support count.

2. Top Candidates’ total contribution and contribution composition
Hillary is the candidate who received the largest contribution by total. Despite of wide popularity of Bernard, the total contribution of his is only about half of Hillary’s, since Bernard received supports mostly from small contributions. Hillary seem to be popular accross different groups. She also managed to receive bulk of big contributions that are over 1000USD each.

3.Support Pattern by Region
Most of contributions are concetrated in the metropolitan area in California, especially the Bay Area and the LA Metropolis. Although the support to Bernard is overwhelming accross California, his popularity is more significant in Northern California than the Mid and the Southern California.

Reflection
Summary
This analysis screens the support pattern in California. We found Democratic candidates has wide popularity in California, especailly Bernard and Hillary. While the number of Repubican Candidates is large, their combined support is still less than either Bernard and Hillary.
Further examining the support pattern by candidate in democratic party we found Bernard is more popular in the Northern California and low-income class. Hillary are more welcome in the upper class and high-income neighborhoods. Candidate support within a party differs on supporters’ income level, zipcode, occupation.
Struggle
In my exploratory analysis, my biggest struggle is on two things
- How to subset the data The geo variables like zipcode and cities have so many levels, and readers actually can process limited number of them in a plot, so I struggle about whether to choose 5 or 10 or 20 of them to showcase the relation. In the end, I chose 10, it contains information, and not overwhelming.
- How to recategorize data I first created the variable ’party. But later I found it is a category of coarse granularity while
cand_name is a category of over fine granularity. In the end, I create a category that seperate Hillary from Bernard, and make them at the same level of republican candidates. Also this seems to be illogical, but I found the category help tells lots of story. I also feel the contribution amount category is very helpful and connect the whole story together, as it clearly mark out the difference in terms of supporters between Hillary and Bernard.
Future Works
As California is known to be a democratic state for a long time. It would be interesting to compare the pattern presidential contribution to national average or other states. On the other hand, it is also possible to contruct a classification model to predict the candidate based on the amount of contribution, the zipcode, and their occupation.